In [ ]:
import sklearn
You are able to import this because the module sklearn
is already part of the Anaconda
distribution.
You can explore the modules that are part of sklearn by doing from sklearn import
and then pressing Tab
.
In [ ]:
# this it below
from sklearn import
# this also works with submodules
from sklearn.linear_model import
In [ ]:
# from the submodule linear_model, lets import LinearRegression
from sklearn.linear_model import LinearRegression
Python is based on object-oriented programming (OOP).
The imported LinearRegression
is a class
definition. You can know the parents of a class by retrieving the __bases__
property
In [ ]:
LinearRegression.__bases__
To create an object, you call the class with parameters. To retrieve the possible parameters of class (or function) in the notebook,
you can Shift-Tab
(preview), double Shift-Tab
(expanded window), triple Shift-Tab
(expanded window with no time out), quadruple Shift-Tab
(for split view of help)
In [ ]:
# try it below
LinearRegression()
Now, lets create a linear regression object
In [ ]:
lr = LinearRegression()
Again, we can explore that object by typing the name of object, then .
, and then Tab
In [ ]:
# try it here
lr.
if we type lr
into the notebook, we will get a customize description of the object
In [ ]:
lr
we can obtain a more programmatically class description by calling the built-in type
command
In [ ]:
type(lr)
Now, objects have a global identity
In [ ]:
id(lr)
In [ ]:
from sklearn.datasets import load_diabetes
In [ ]:
diabetes_ds = load_diabetes()
In [ ]:
X = diabetes_ds['data']
y = diabetes_ds['target']
sklearn
works mostly with numpy
array, which are $n$-dimensional arrays.
In [ ]:
[type(X), type(y)]
You can check the number of dimensions of an array
In [ ]:
X.ndim
Check the size of the dimensions
In [ ]:
X.shape
Get slices of the dimensions. The following are all the same thing: grab the first two rows of a matrix
In [ ]:
X[0:2]
In [ ]:
X[:2]
In [ ]:
X[0:2, :]
We can also grab columns in the same way
In [ ]:
X[:, 0:2]
Sometimes you want to grab just one column (feature), but the numpy
returns a one dimensional object
In [ ]:
X[:, 2].shape
We can reshape the $nd$-array and add one dimension:
In [ ]:
X[:, 2].reshape([-1, 1])
In [ ]:
X[:, 2].reshape([-1, 1]).shape
You can do matrix algebra:
In [ ]:
# transpose
X.T.shape
In [ ]:
X.dot(X.T).shape
For more functions, you can importa numpy
In [ ]:
import numpy.linalg as la
In [ ]:
la.inv(X.dot(X.T)).shape
OK, let's go back to our example with linear regression.
Usually sklearn
objects starts by fitting the data, then either predicting or transforming new data. Predicting is usually for supervised learning and transforming is for unsupervised learning.
In [ ]:
# explore the parameters of fit
lr.fit
In [ ]:
lr2 = lr.fit(X[:, [2]], y)
fit
returns an object. If we examine the id of the object it returns:
In [ ]:
id(lr2)
In [ ]:
id(lr)
We realize that it is the same object lr
, therefore, the call is fitting the data and modifying the internal structure of the object and it is returning itself.
Therefore, you can chain calls, which is very powerful feature.
By looking at the online documentation of the LinearRegression
, we can know the parameters it found.
In [ ]:
lr.intercept_
In [ ]:
lr.coef_
In [ ]:
# explore the parameters
lr.predict
In [ ]:
y_pred = lr.predict(X[:, [2]])
Because we know how linear regression works, we can produce the predictions ourselves
In [ ]:
y_pred2 = lr.intercept_ + X[:, [2]].dot(lr.coef_)
In [ ]:
# this checks that all entries in the comparison are True
np.all(y_pred2 == y_pred)
Now, due to the powerful concept of chaining, we can combine fit and predict in one line
In [ ]:
y_pred3 = lr.fit(X[:, [2]], y).predict(X[:, [2]])
In [ ]:
np.all(y_pred3 == y_pred)
Sometimes you want to use a package that you found online. Many of these packages are available throught the Python Install Packages
(PIP) package manager.
For example, the package quandl
allows quants to load financial data in Python.
We can install it in the console simply by typing
pip install quandl
And now we should be able to import that package
In [ ]:
import quandl
In [ ]:
import quandl
mydata = quandl.get("YAHOO/AAPL")
In [ ]:
mydata.head()
In [ ]:
# this helps put the plot results in the browser
%matplotlib inline
Pandas
is a package for loading, manipulating, and display data sets. It tries to mimick the funcionality of data.frame
in R
In [ ]:
import pandas as pd
Many packages return data in pandas
DataFrame
objects
In [ ]:
apple_stocks = quandl.get("YAHOO/AAPL")
In [ ]:
type(apple_stocks)
We can display the beginning of a data frame:
In [ ]:
apple_stocks.head()
In [ ]:
apple_stocks.tail()
And also, we can plot it with pandas
In [ ]:
apple_stocks.plot(y='Close');
We can manipulate it too. Let's say we want to compute the stock returns
$$ r = \frac{V_t - V_{t-1}}{V_{t-1}} - 1$$But for this, we need to compute a rolling filter
In [ ]:
apple_stocks[['Close']].pct_change().head()
In [ ]:
apple_stocks[['Close']].pct_change().plot();
In [ ]:
apple_stocks[['Close']].pct_change().hist(bins=100);
Spark
is a distributed in-memory big data analytics framework. It is hadoop
on steriods.
Because we launched this jupyter
notebook with pyspark
, we have available automatically a variable called Spark
context sc
which gives us access to the master and therefore to the workers.
If we go to see the Spark
dashboard (usually in port 4040
), we can see some of the variables.
With Spark
context you can read data from many sources, including HDFS
(Hadoop File System), Hive
, Amazon's S3
, files, and databases.
In [ ]:
# explore the variables and functions availabe in the Spark context
sc
Spark usually works with RDD
(Resilient Distributed Dataset) and more recently they are moving towards DataFrame
, which are similar to Pandas
but distributed instead.
In [ ]:
rdd_example = sc.parallelize([1, 2, 3, 4, 5, 6, 7])
We can check the id
of the RDD
in the cluster
In [ ]:
rdd_example.id()
In [ ]:
# this is a RDD
type(rdd_example)
Let's explore the funcions we have available
In [ ]:
rdd_example.
One such function is take
that allows you to get a taste of what the file contains
In [ ]:
rdd_example.take(3)
Let's say you want to apply an operation to each element of the list
In [ ]:
def square(x):
return x**2
now we can apply that transformation to the RDD
with the map
function
In [ ]:
rdd_result = rdd_example.map(square)
Now you might notice that this returns immediately. Well, this is because operations on RDD
are lazily evaluated
In [ ]:
type(rdd_result)
So rdd_result
is another RDD
In [ ]:
rdd_result.id()
Now in fact, there is no duplication of data. Spark
builds a computational graph that keeps tracks of dependencies and recomputes if something crashes.
We can take a look at the contents of the results by using take
again. Since take
is an action, it will trigger a job in the Spark cluster
In [ ]:
rdd_result.take(3)
In [ ]:
rdd_result.count()
In [ ]:
rdd_result.first()
Usually, one you have your results, you write it back to Hadoop
for later preprocessing, because they usually won't fit in memory.
In [ ]:
# this function can save into HDFS using Pickle (Python's internal) format
rdd_result.saveAsPickleFile()
Now, DataFrame
has some structure. Again, you can create them from different sources. In this case, DataFrame
funcionality is available from another context called the sqlContext
. This gives us access to SQL-like transformations.
In this example, we will use the sklearn
diabetes dataset again
In [ ]:
from sklearn.datasets import load_diabetes
import pandas as pd
In [ ]:
diabetes_ds = load_diabetes()
To create a dataset useful for machine learning we need to use certain datatypes
In [ ]:
from pyspark.mllib.regression import LabeledPoint
In [ ]:
l
In [ ]:
from pyspark.ml.linalg import Vectors
In [ ]:
d
In [ ]:
Xy_df = sqlContext.createDataFrame([
[float(l), Vectors.dense(d)] for d, l in zip(diabetes_ds['data'], diabetes_ds['target'])],
["y", "features"])
In [ ]:
Xy_df
We can register the table in Spark as an SQL
In [ ]:
Xy_df.registerTempTable('Xy')
And then run queries
In [ ]:
sql_result1_df = sqlContext.sql('select count(*) from Xy')
In [ ]:
# which again is lazily executed
sql_result1_df
In [ ]:
sql_result1_df.take(1)
We can again run large scale regression using DataFrame
In [ ]:
from pyspark.ml.regression import LinearRegression
In [ ]:
lr_spark = LinearRegression(featuresCol='features', labelCol="y")
In [ ]:
lr_spark.coefficients
In [ ]:
lr_results = lr_spark.fit(Xy_df)
In [ ]:
lr_results.coefficients
In [ ]:
lr_results.intercept